Dataframes

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • What is a dataframe?
  • Observations
  • Variables
  • Recoding variables
  • Scope of data

What is a dataframe?

Table

It s a generic name. It can be almost anything.

  • Periodic table
  • Multiplication table
  • Truth table
  • Chi squared table
  • Phonetic table

Data(s)

  • Source of information (SI): Raw empirical material.
  • Data (s/p): Collected, processed, systematized and organized SI (Van Evera 2009).
    • Numbers, characters, symbols … no meaning.
  • Database: An organized collection of data stored and accessed electronically / An organized collection of data stored as multiple datasets.
  • Dataset: A structured collection of data generally associated with a unique body of work.

Spreadsheet

How Excel stores data in two dimensions:

Dataframe

A way1 to store data in R in two dimensions: rows and columns2:

# A tibble: 17,548 × 9
   scode country      year polity2 xrreg xrcomp xropen xconst parreg
   <chr> <chr>       <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 AFG   Afghanistan  1800      -6     3      1      1      1      3
 2 AFG   Afghanistan  1801      -6     3      1      1      1      3
 3 AFG   Afghanistan  1802      -6     3      1      1      1      3
 4 AFG   Afghanistan  1803      -6     3      1      1      1      3
 5 AFG   Afghanistan  1804      -6     3      1      1      1      3
 6 AFG   Afghanistan  1805      -6     3      1      1      1      3
 7 AFG   Afghanistan  1806      -6     3      1      1      1      3
 8 AFG   Afghanistan  1807      -6     3      1      1      1      3
 9 AFG   Afghanistan  1808      -6     3      1      1      1      3
10 AFG   Afghanistan  1809      -6     3      1      1      1      3
# … with 17,538 more rows

A tidy dataframe

We consider that a dataframe is tidy if it fulfills the following requirements (Wickham 2014):

  • Each dataframe has one unit of observation.
  • Observations are represented in the rows.
  • Variables are represented in the columns.
  • Each cell indicates a value.

Observations

Observing …

We need to decide which are the units of interest.

What is an observation?

  • Unit of analysis: The thing that we want to know about.
    • Marked by the hypothesis / question.
  • Unit of observation: Each row of a dataframe.
    • Marked by the instrument of measurement.
# A tibble: 8 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
7 Afghanistan Asia       1982    39.9 12881816      978.
8 Afghanistan Asia       1987    40.8 13867957      852.

Example: Macro level

States, regions, legal systems …

# A tibble: 15 × 6
   WarName            WarType CcodeA SideA          CcodeB SideB                
   <chr>                <dbl>  <dbl> <chr>           <dbl> <chr>                
 1 First Caucasus           5    365 Russia             -8 Georgians, Dhagestan…
 2 Sidon-Damascus           6     -8 Sidon              -8 Damascus & Aleppo    
 3 First Two Sicilies       4    300 Austria            -8 -8                   
 4 First Two Sicilies       4    329 Two Sicilies       -8 Liberals             
 5 Spanish Royalists        4    230 Spain              -8 Royalists            
 6 Sardinian Revolt         4    300 Austria            -8 -8                   
 7 Sardinian Revolt         4    325 Sardinia           -8 Carbonari            
 8 Greek Independence       5    640 Ottoman Empire     -8 Greeks               
 9 Greek Independence       5     -8 -8                200 United Kingdom       
10 Greek Independence       5     -8 -8                220 France               
11 Greek Independence       5     -8 -8                365 Russia               
12 Egypt-Mehdi              6     -8 Egypt              -8 Mehdi army           
13 Janissari Revolt         4    640 Ottoman Empire     -8 Janissaries          
14 Miguelite War            4     -8 -8                200 United Kingdom       
15 Miguelite War            4    235 Portugal           -8 Constitutionalists   

Intra-State War Data (Correlates of War)

Example: Meso level

Organitzations, ethnic groups, political parties …

# A tibble: 14 × 5
   countryname  year groupname statusname     groupsize
   <chr>       <dbl> <chr>     <chr>              <dbl>
 1 Belgium      1967 Flemings  JUNIOR PARTNER     0.59 
 2 Belgium      1967 Walloon   SENIOR PARTNER     0.4  
 3 Belgium      1967 Germans   IRRELEVANT         0.01 
 4 France       1967 French    MONOPOLY           0.976
 5 France       1967 Basques   POWERLESS          0.013
 6 France       1967 Corsicans POWERLESS          0.004
 7 France       1967 Roma      DISCRIMINATED      0.006
 8 Belgium      1968 Flemings  JUNIOR PARTNER     0.59 
 9 Belgium      1968 Walloon   SENIOR PARTNER     0.4  
10 Belgium      1968 Germans   IRRELEVANT         0.01 
11 France       1968 French    MONOPOLY           0.976
12 France       1968 Basques   POWERLESS          0.013
13 France       1968 Corsicans POWERLESS          0.004
14 France       1968 Roma      DISCRIMINATED      0.006

International Conflict Research

Example: Micro level

Families, individuals, relationships …

# A tibble: 1,599 × 5
   age   language       urban_rural region    electricity_nearby
   <chr> <chr>          <chr>       <chr>     <chr>             
 1 26    Igbo           Urban       IMO       Yes               
 2 25    Other          Rural       FCT ABUJA Yes               
 3 35    Hausa          Rural       FCT ABUJA Yes               
 4 79    Other          Rural       FCT ABUJA Yes               
 5 19    English        Rural       FCT ABUJA Yes               
 6 34    Igbo           Urban       IMO       Yes               
 7 30    Pidgin English Rural       FCT ABUJA Yes               
 8 32    Hausa          Rural       FCT ABUJA Yes               
 9 50    Other          Rural       FCT ABUJA Yes               
10 18    English        Rural       FCT ABUJA Yes               
# … with 1,589 more rows

Example: Events

Bombings, contracts, terrorist attacks…

# A tibble: 477 × 8
   cowcode region  year country    no  coup successful combat
     <dbl>  <dbl> <dbl> <chr>   <dbl> <dbl>      <dbl>  <dbl>
 1      40      5  1952 Cuba        1     1          1      1
 2      40      5  1957 Cuba        1     1          0      1
 3      41      5  1950 Haiti       1     1          1      0
 4      41      5  1956 Haiti       1     1          0      0
 5      41      5  1957 Haiti       1     1          1      0
 6      41      5  1957 Haiti       2     1          1      0
 7      41      5  1957 Haiti       3     1          1      0
 8      41      5  1958 Haiti       1     1          0      1
 9      41      5  1970 Haiti       1     1          0      0
10      41      5  1986 Haiti       1     1          1      0
# … with 467 more rows

Coup Agency and Mechanisms Dataset

Ecological fallacy

“Els rics són menys corruptes”.

  • Simpson’s paradox.
  • Durkheim.
  • Wealthier states tend to vote Democratic.
  • Covid a hospitals.

Local elections

What if…

Elections in Barcelona

Variables

What is a variable?

A characteristic of the object we’re studying.

  • It varies across units.

Types of variables (I): Nominal

[1] "England" "France" 
[1] England France 
Levels: England France

Types of variables (I): Ordinal

[1] Small Large
Levels: Large < Small

Types of variables (III): Interval

Zero is arbitrary.

[1]  3  9 13

Types of variables (III): Ratio

Zero means absence of

[1] 4.9 9.0

Recoding variables

Van Evera, Stephen. 2009. Guía para Estudiantes de Ciencia Política: Métodos y Recursos. Barcelona: Gedisa.
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.